quadratic activation function
- Europe > France (0.05)
- North America > United States > California (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Switzerland (0.04)
Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m*<=m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.
Review for NeurIPS paper: Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions
Reviews for this paper are mitigated, in particular some reviewers were concerned about some missing proofs. On the other hand, the paper studies an important problem and carries a nice analysis that integrates numerical experiments, heuristic derivations and rigorous proofs in a meaningful way; and the reader learns a lot about such models (quadratic 2-layer networks with sparse teacher). It is thus necessary that the authors spend a lot of effort writing the missing proofs thoroughly because it will not be possible to review those proofs again (and of course all the other changes proposed in the rebuttal should be implemented). Overall, for such a paper that contains true statements, conjectures and heuristics, it is very important to emphasize on the "truth status" of each statement, and "true statements" should have a proof.
Review for NeurIPS paper: Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions
For random initialization, I also believe that it still needs a lot of effort. The upper bound of E(A(t)) is clearly dependent on the condition number of A(0) instead of simply dividing the cases into full-rank and rank-deficient. Moreover, rather than only focusing on the full-rank case, the author may consider the problem uniformly and continuously, e.g., the MP-law from RMT may help to provide an asymptotic analysis for the random initialization since the universal distribution for the eigenvalues are given. Also, there may exist the non-asymptotic version, but more perturbation bounds are needed. BTW, due to my research background, I neglected the development of shallow neural networks with random Gaussian input. I am sorry about that and raise my score.
Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions
We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the overparametrized regime where the layer width m is larger than the input dimension d. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width m* m. We describe how the empirical loss landscape is affected by the number n of data samples and the width m* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on n, d, and m*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice.
Nonconvex sparse regularization for deep neural networks and its optimality
Recent theoretical studies proved that deep neural network (DNN) estimators obtained by minimizing empirical risk with a certain sparsity constraint can attain optimal convergence rates for regression and classification problems. However, the sparsity constraint requires to know certain properties of the true model, which are not available in practice. Moreover, computation is difficult due to the discrete nature of the sparsity constraint. In this paper, we propose a novel penalized estimation method for sparse DNNs, which resolves the aforementioned problems existing in the sparsity constraint. We establish an oracle inequality for the excess risk of the proposed sparse-penalized DNN estimator and derive convergence rates for several learning tasks. In particular, we prove that the sparse-penalized estimator can adaptively attain minimax convergence rates for various nonparametric regression problems. For computation, we develop an efficient gradient-based optimization algorithm that guarantees the monotonic reduction of the objective function.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Smooth function approximation by deep neural networks with general activation functions
There has been a growing interest in expressivity of deep neural networks. But most of existing work about this topic focus only on the specific activation function such as ReLU or sigmoid. In this paper, we investigate the approximation ability of deep neural networks with a quite general class of activation functions. This class of activation functions includes most of frequently used activation functions. We derive the required depth, width and sparsity of a deep neural network to approximate any H\"older smooth function upto a given approximation error for the large class of activation functions. Based on our approximation error analysis, we derive the minimax optimality of the deep neural network estimators with the general activation functions in both regression and classification problems.